feat: gather-info hardening + vm-troubleshooting-dashboard (v3 schema)#3
Merged
Conversation
…collectors Bug fixes: - Detect DCGM daemon not running and prompt to enable it (exit 235 fix) - Add missing ibstatus/ibv_devinfo parser hints that caused phantom errors - Fix nvidia facts defaulting to empty string instead of "unavailable" - Fix omitempty silently dropping zero-value command state fields in manifest - Fix docker gate logic to distinguish permission/daemon/timeout failures - Fix critical_event_count reporting capped count instead of true total Reliability improvements: - Single-pass artifact checking in all triage analyzers with firstPayloadLine - Panic recovery around each triage analyzer - Rewrite archive.go with WalkDir, named returns, and errors.Join - Reuse docker info output for gate check instead of running twice - Extract saveDirConcat helper to deduplicate network/packages collectors - Add SanitizePathComponent for safe artifact paths from container names - Pointer fields in report.go for explicit zero emission New tests: - Contract tests ensuring Go vocabularies stay in sync with JSON schemas - Artifact state tests for firstPayloadLine and checkArtifact - Triage determinism tests for stable output ordering - Xid analyzer unit tests Other: - Add triage-result.schema.json for triage _data/*.json files - Narrow lspci GPU detection to 3D/VGA/Display PCI classes - Remove unused UI screenshots (non-selected.png, selected.png) - Update documentation to reflect current architecture
Add 3-tier adaptive banner: full BigText (84+ cols), stacked BigText (43-83 cols), compact styled text (<43 cols). Increase two-column info box threshold so it collapses to single-column at moderate widths instead of rendering cramped values.
Add required issue/finding codes, confidence, and stable fingerprints across collector, triage, runner, manifest, report, schemas, and tests to provide machine-stable identity and recurrence tracking. Made-with: Cursor
Add bounded structured journal ingestion and fallback-aware triage parsing, then split Xid catalog parsing from local policy while making critical pattern identity explicit for deterministic, stable findings and aligned docs. Made-with: Cursor
…iqueness, and report completeness - Write placeholder files for skipped artifacts so skip_reasons never reference paths missing from the archive (common.go saveSkippedArtifact) - Fix critical event fingerprints: include log line content in the hash so distinct findings produce unique fingerprints for the UI - Add skip_count and error_count to report.ndjson collector_summary records - Fix journalctl --until to use local time format instead of UTC/RFC3339 - Rename nowUTC → nowFunc to reflect local-time semantics - Add verbose logging for structured journal probe paths - Strip git tag namespace prefix from version display in info box - Bump version to 0.2.0
Addresses all findings from the 3-machine production audit (5090-076,
5090-087, rtx6000pro). Key changes:
Privilege escalation: replace per-command sudo with a single re-exec
under sudo at startup via syscall.Exec, eliminating overhead from 48
individual sudo invocations per run.
Triage hardening:
- Split "Kernel Panic" pattern into separate Kernel Panic (CRITICAL),
Kernel BUG (WARNING), and Kernel Stack Trace (INFO) findings
- Add source-aware pattern matching so kernel-only patterns skip
non-kernel journal_errors lines (prevents false positives)
- Add NVIDIA RPC Failure, NVIDIA Driver Assertion, and Segfault patterns
- Add hex address normalization for segfault dedup
- Detect NDJSON truncation and emit FindingDataQuality warnings
(survives zero-event early returns in both critical.go and xid.go)
- Bump schema version to 3.1.0
Collector fixes:
- Parse GPU CSV per-line to handle nvidia-smi error lines mixed with
valid rows (partial failure when GPU fallen off bus)
- Handle OOM probe timeout separately from exit errors
- Fix fabricmanager label ("not running" not "inactive")
- iptables uses IgnoreExit to avoid false warnings on nf_tables systems
- Strip ANSI escapes from nvidia-smi topo output (TERM=dumb + StripANSI)
Also adds Xid catalog codegen tooling and Makefile updates.
Add __pycache__/ to gitignore. Replace stringly-typed path check for ANSI stripping with a struct field in nvidia collector.
The Makefile was writing a dangling xid.md to the repo root on every build. Make the flag optional and drop it from the Makefile so only the two Go source files needed for compilation are generated.
…tignore - Dashboard: Go API, SQLite ingest/store, pathutil, React+Vite UI, mirrored schemas - Collector: journal NDJSON streaming, Docker/sanitize, triage enrichment+evidence, runner/output/schema updates - Repo: monorepo AGENTS/CLAUDE/README, ARCHITECTURE+SCHEMA-COMPATIBILITY, docs index, .gitignore refresh
…nrichment Require identity headers when using trust-forwarded auth; cap and evict upload rate limiters; symlink-safe SafeJoin; SaveBounded + ingest cap; ListPage with SQL pagination; nullable triage_finding_count (API + UI). Journal NDJSON via bytes.Buffer; enrichment errors recorded on collector results; sanitize tests; triage/integration/version and CODEMAP updates.
… triage Register EDAC, PCIe AER, IPMI, and thermal collectors; add shared sysfs/PCI helpers, saveJSONProbe, and isPhysicalPort for consistent device classification. Network: skip resolvectl when systemd-resolved is absent; collect physical NIC error counters (sysfs, ethtool, devlink). InfiniBand: gate IB-only tools on infiniband link_layer, keep rdma link on RoCE-only hosts, add perfquery. Journal: boot/previous-boot views, bounded previous-boot errors, /var/crash listing with crash_dump_count and manifest integer-fact typing. Triage: broaden critical kernel/hardware/disk/net patterns; strip dmesg-style timestamps in normalizeCriticalLine so journal vs dmesg fingerprints align. Extend manifest/report schemas (parser hints, tags) for vm-troubleshooting and the dashboard copy; refresh AGENTS.md and CODEMAP.md accordingly.
Deployment & packaging
- Add Dockerfile (pnpm frontend build + CGO-free Go binary), docker-compose
(Caddy → oauth2-proxy → dashboard), DEPLOYMENT.md runbook, .dockerignore,
.env.example, Caddyfile.example, oauth2-proxy.cfg.example.
- Add cmd/dashboard entrypoint: data/web flags, archive cap, graceful
shutdown; require --auth-shared-token or --trust-forwarded-user on
non-loopback listen; tests for listenRequiresAuth (incl. IPv6 loopback).
Backend API & persistence
- API: GET /api/v1/archives/{id}/issue-state, POST .../issue-state/{fp} with
state ack|dismissed|; prefer X-Forwarded-Email over X-Forwarded-User for
uploaded_by when trusting proxies.
- SQLite: issue_state table; archives.uptime_seconds; collectors.skip_reasons_json;
extra indexes on issues (confidence, category, collector, code, fingerprint).
- Store: migrations, SetIssueState/LoadIssueStates, persist uptime and skip
reasons; list/get hydrate them.
- Ingest: attach collector SkipReasons; extract uptime from system/hypervisor
facts (uptime_seconds / system.uptime_*); uptime unit tests.
- Evidence suggestions: category-based path-prefix bonus (DISK/HW/NET/MEM/GPU);
category tests. Issue-state store tests.
Frontend — overview & issues
- New DashboardPage: severity KPI grid, composed overview copy, host context,
facts/signals, skips, collector grid, grouped issues preview (facts, boot,
grouping, title, summary, source, units libs + Vitest coverage).
- IssuesPage: URL-backed hideLowConf/grouped/hideDismissed; low-confidence
hidden-count banner; compareIssues default sort; pattern groups vs flat list;
confidence pills; IssueRowBody avoids duplicated title/message rendering.
- IssueDetailPage: ack/dismiss/clear, copy summary, confidence pill, pattern
siblings, occurrence/source callout, ranked prev/next + j/k shortcuts,
severity badge icons.
Frontend — artifacts & polish
- ArtifactBrowser: path filter, severity dots from related_artifact_paths, JSON
tree + system overview card, #L line hash highlight + share links.
- SeverityBadge: shared severity metadata + icons; KV long-value wrapping.
- Types: SkipReason, issue states, uptime_seconds; utils: sampleLine,
cleanTitle for primary finding titles.
- archives API: invalidate detail cache on re-upload; removeQueries on delete.
- issues API: issue-state hooks.
Tooling & docs
- package.json: vitest + Testing Library + jsdom; pnpm-lock.yaml updated.
- vitest.config.ts + vitest.setup.ts.
- .gitignore: /dashboard anchor, ignore live Caddy/oauth2-proxy configs.
- CODEMAP: link DEPLOYMENT.md and USER_GUIDE.md.
…ured component + chip breakdown
Under the "System Logs" category, dozens of runtime errors collapsed into an
indistinguishable list (WSL, dxgk, LSB, udev, NVIDIA, arbitrary daemons).
Users had no way to pivot. This change surfaces structure: every journal
row carries a component sub-label (SYSLOG_IDENTIFIER, authoritative), the
Source column renders as primary + muted secondary (two lines, not a
run-on string), and expanded groups get a chip-row breakdown users can
filter by. Shape before detail.
Backend (Go)
- IssueRecord.Component (optional, omitempty) + nullable `component` column on
issues table with an additive migration that also backfills existing
archives on startup (mirrors the established triage_finding_count pattern;
WHERE component IS NULL for re-runnability; per-archive error boundary so
a missing/corrupt tarball cannot block upgrade).
- New store/journal_component.go: extensible SourceHandler slice indexes
logs/journal_errors.ndjson by normalized message → SYSLOG_IDENTIFIER (falls
back to _SYSTEMD_UNIT minus .service). Majority-wins resolution across all
TriageFinding evidence lines; ties return empty (frontend fallback takes
over). Tactical duplication of normalizeCriticalLine from the collector
(two Go modules, internal/ boundary) — release-blocking TestNormalizeParity
guards drift; delete this file when the collector schema promotes the
identifier upstream.
- Ingest calls EnrichIssues in buildDetail(); LoadTriageMap exported so the
backfill and ingest paths share it.
Frontend (TypeScript)
- New lib/log-component.ts owns two shared helpers — isLogSourced() gates the
secondary line to log-sourced issues only (ERR/KERN or .ndjson / dmesg.txt /
logs/* artifact paths) and effectiveComponent() enforces a single precedence
chain (structured → ruleset → null). Both source.ts and breakdown.ts import
from here so the Source cell and chips can never disagree.
- lib/source.ts now returns { primary, secondary? } with a compareSourceLabels
helper; old `A · B` concatenation gone.
- lib/component.ts shrunk to a bounded ordered ruleset (WSL, dxgk, NVIDIA,
LSB, udev + syslog fallback) with a hard "one rule per family" policy in
the header and a documented success condition ("delete this file when the
structured path covers the tail"). \b-anchored WSL/LSB rules tolerate
leading prefix tokens.
- New lib/breakdown.ts: pure buildComponentBreakdown + shouldShowBreakdown
helpers. Count-desc sort with alphabetical tiebreak, 6-chip cap with
+N-more overflow, "other" bucket for ruleset-miss so counts always sum to
the group total. overflowAriaLabel names every hidden entry for screen
readers.
- IssuesPage: SourceCell component for the two-line cell (three call sites
kept DRY); chip row above expanded members with aria-pressed toggle
filtering; "other" chip rendered dashed + italic so it reads as
second-class; parent group row's secondary is suppressed when the chip row
will enumerate multiple components (prevents showing sample's component
as if it spoke for the group).
- DashboardPage TopIssues: subline now primary + muted-opacity secondary
inline, not a compound string.
- IssueDetailPage: Category / Confidence / Severity rows removed from the
METADATA sidebar (already present as pills next to the title — sidebar
keeps Code / Collector / Fingerprint only).
- sampleLine is null-safe.
Tests
- Backend: +3 test files of coverage. TestNormalizeParity exhaustively
pins every regex in the copy to the collector; lookup/majority/tie/fallback
and malformed-NDJSON cases; store migration + backfill round-trip;
integration ingest asserts structured Component on a journal row and empty
on a dmesg row in the same archive. Go test: 84 passed.
- Frontend: 33 new cases across log-component, breakdown, component (6 new
family rules + a prefixed-WSL regression), source (scoping gate regression),
and sampleLine null safety. Vitest: 120 passed.
Explicitly not in this change
- No collector / archive-schema changes. The structured-component path is
implemented at the dashboard ingest boundary.
- No groupKey / fingerprint-semantics changes. Chips are an in-expansion
pivot only.
- Heuristic auto-inference (TF-IDF, clustering) deliberately rejected —
systemd's SYSLOG_IDENTIFIER is the auto-detection; the ruleset is a
bounded fallback for the non-journal tail.
- Track B polish (OVS facts, TopIssues pill alignment, stat-card hierarchy,
fingerprint copy, group counter wording) — deferred to its own PR.
…fingerprint copy affordance, finding/event wording
Final polish pass paired with the Track A log-digestibility landing.
B1 (OVS fact registry + dotted-key humaniser fallback) and B2 (TopIssues
pill stacking into a fixed-width left column) were already in place before
this commit; the items below close out the remaining gaps.
B3 — Stat-card visual hierarchy. Bumped the Critical tile's number from
text-2xl to text-3xl font-bold so it reads loudest; Warning stays
text-2xl/semibold, Info/Total stay text-xl/medium + muted foreground. Tile
dimensions are unchanged, so the four-card grid doesn't reflow.
B4 — Fingerprint copy affordance (per review hardening). Separated the
truncated hash from the copy action: the hash renders as static
<span>, and a dedicated 6×6 icon button sits beside it with an
aria-label ("Copy fingerprint" / "Fingerprint copied"), visible focus
ring, and a Check/Copy icon swap on success. Users can no longer misread
the hash as a navigable control. copyToClipboard (already extracted in
lib/clipboard.ts) is reused from CopySummaryButton.
B5 — Group counter wording. Renamed "(N patterns · M events)" to
"(N findings · M events)" in both IssuesPage and DashboardPage TopIssues,
since group.count = members.length (distinct findings, each with its own
fingerprint) and group.occurrences sums the (Nx in ...) suffixes.
"Findings" matches CX support vocabulary better than Sentry-style
"patterns".
Verification: 120 Vitest cases green; backend untouched; pnpm build clean.
…itle, member messages, fingerprint label Three concrete issues surfaced while reviewing the Nautilus archive in compact / narrow-viewport screenshots. - Issues list: the group parent row was clipping "Error/Fail" to "Err..." because the inline count string "(N findings · M events)" shared a line-clamp-1 flex container with the title. Moved the count to its own muted second line — title never fights the counter for horizontal space, both are readable, and the overall row height is unchanged since the subline was already present for non-singleton groups. - Issues list: member-row messages under an expanded group were truncating at a common prefix (e.g. four rows all showed "misc dxg: dxgk:..." with no way to see -2 vs -22 vs -75). Bumped the compact subline to line-clamp-2 and the titleless fallback to line-clamp-3; non-compact views keep their original single-line clamp. - Issue detail METADATA: FingerprintCopy was rendering its label as uppercase tracking-wide while the adjacent KV rows (Collector, Code) used mixed-case text-xs. Aligned to the KV style so the three labels in the sidebar read as one group.
… viewports .top-issue-list was declared as `display: grid` with no `grid-template-columns`, so each row's implicit track defaulted to min-width:auto — which means the track expands to the child's widest unbreakable content and the `truncate` on the subline becomes a no-op. Long sublines like "System Logs · misc dxg: dxgk: dxgkio_is_feature_enabled: Ioctl failed: -75 (2x in logs/journal_errors.ndjson)" then pushed the whole card past its container, creating visible horizontal overflow on the Overview page at phone widths. Fix is the same one already used by .artifact-browser-grid in the same stylesheet — `grid-template-columns: minmax(0, 1fr)` constrains the track to the available width and lets the inner truncate/line-clamp clipping take over.
… column into a single finding card
The old layout produced three problems across every issue variant:
- WHAT HAPPENED card echoed the page title (e.g. "Failed Systemd
Services" / "Firewall Posture: inactive" / "Error/Fail") right
underneath the page's own h2 — pure duplication.
- RECOMMENDED ACTION lived in its own card immediately below. Two
separate frames for one continuous thought ("here's what / here's
what to do") increased visual weight without adding information.
- Low-confidence log findings (no explicit action) left the left
column looking empty next to a 3-card sidebar.
Merged the two cards into one vertically-flowing "finding" card with
signposted sub-sections:
[ hero stat strip ] occurrence count + source link + group size
What happened prose description (with (Nx in …) tail stripped,
dropped entirely when it would just echo the
Evidence block below)
Recommended action inline green-accent row (not a full card)
Evidence raw matched lines in a mono block — the actual
string a CX agent would grep for
How to investigate three generic tips, only shown when no explicit
action is available (so low-conf Error/Fail is
never just a title)
The hero stat strip promotes (Nx occurrences) to a proper tabular-num
hero number, makes the source file a real link, and adds "N in this
group" when the pattern has siblings. Card renders these conditionally
so short-signal issues (single firewall fact, one failed service) don't
carry an empty band.
120 Vitest cases still green; issue-detail view has no dedicated
component test yet (earlier plan follow-up), but the behaviour is
covered by the data the triage collector emits.
…its body would be empty or duplicated
Two holes in the previous consolidation pass were visible in the
refreshed Error/Fail and Firewall screenshots:
- Error/Fail (new_issue_3.png): when the description was dropped
because it equalled the single-line Evidence below, the "WHAT
HAPPENED" section label still rendered above a now-empty body. The
card read "WHAT HAPPENED / [nothing] / EVIDENCE / misc dxg…".
- Firewall Posture (new_issue_2.png): the description still duplicated
the page subline ("Firewall inactive (ufw)" written twice within
~90px). The earlier dedup only compared description-to-evidence,
not description-to-subline.
Moved the entire "What happened" sub-section decision into one IIFE
that computes the effective body, hides the section label when the
body would be empty, and dedups against both the single-line evidence
block AND the page subline. For short-signal findings (firewall,
single-service failures) the card now collapses cleanly to just the
Recommended action; for log-match findings with only evidence, the
stat strip + Evidence + How-to-investigate read without a ghost label.
…ips, and column balance
Three remaining refinements called out on the previous review pass.
- Header separates "what this is" from "what I can do". Severity and
confidence pills (display — describe the finding) move inline with the
title; action buttons (Ack / Dismiss / Copy summary) stay right-aligned
on their own. The previous right cluster mixed interactive and display
controls, which blurred the affordance.
- "How to investigate" bullets are now context-aware. The source-file
bullet reads differently for .ndjson (mentions SYSLOG_IDENTIFIER /
_SYSTEMD_UNIT / timestamp fields as grep anchors) vs dmesg.txt
(suggests reading surrounding kernel lines) vs other files. The cadence
bullet uses the actual occurrence count ("fired N times — tight burst
vs steady cadence") when occCount > 1, or the sibling count when the
finding is a singleton with a populated group. The bullet list stays
capped at 3 items.
- Moved "Other entries in this group" from the sidebar to the main
column, immediately after the finding card. The sibling list is
content (clickable navigation to peer findings), not metadata; placing
it in the main column balances the layout when the sidebar was
outweighing a thin left column on low-signal issues. Sidebar now
carries only METADATA and RELATED ARTIFACTS — the two genuinely
meta-context cards.
120 Vitest cases still green.
…om sibling rows Every row in "Other entries in this group" was starting with the same "Error/Fail:" prefix because the collector prefixes the raw line with the finding title. Since the card header already names the group, the prefix is redundant — worse, it steals ~70px of horizontal space before the differentiating suffix (e.g. "dxgkio_is_feature_enabled: Ioctl failed: -22") can start. Uses the existing sampleLine() helper (same one that strips the prefix from the page subline) so the sibling rows now lead with their distinguishing detail.
…on a two-slot skeleton
Across the three issue-detail variants each view had been landing on a
different skeleton — ni_1 showed "What happened → Recommended action →
Evidence" (3 slots), ni_2 collapsed to "Recommended action" alone
(1 slot), ni_3 rendered "Evidence → How to investigate" (2 slots with
no "What happened" label at all). Same page, three layouts, no shared
mental model for the eye to lock onto.
Collapsed the card down to exactly two always-present slots:
1. "What happened" — prose description + raw evidence in a mono
block, or the raw issue.message as last-resort fallback. Evidence
is now *inside* this slot rather than a sibling card — they answer
the same question (here's the finding, here's the proof). The
prose is skipped only when it's identical word-for-word to a
single evidence line (duplication of text within the same slot,
which reads as clutter); the slot itself always renders.
2. Next steps — always rendered with exactly one label depending on
what we know:
• "Recommended action" — triage supplied finding.action
• "How to investigate" — no action but we have a log source or
evidence path, so the context-aware
tips give useful guidance
• "Suggested next steps" — generic fallback when we have neither
The context-aware tips and the generic fallback share one
ul/li structure, so the visual weight is identical across cases.
Content-driven optional sections (stat strip, "Other entries in this
group", "Supporting findings") remain conditional on the data that
powers them — that variation is meaningful, not accidental. The
main card is now the fixed anchor every issue detail shares.
…tail useIssueDetail was firing a cache-cold load on every first navigation to a new issue — the hook's data briefly became undefined, the component's isLoading branch rendered "Loading issue…", then the finding rendered. Going back was smooth only because the cache was already warm. Added `placeholderData: keepPreviousData` (same pattern useIssues already uses). The previously-viewed issue stays on screen while the next one fetches, so prev/next transitions are uniformly smooth in both directions.
Dashboard mvp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ships the
gather-infov3 schema bump and a new companion dashboardproject that consumes its archives. 22 commits, ~31k insertions across
207 files under
customers/vm-troubleshooting/andcustomers/vm-troubleshooting-dashboard/(new).Collector (
customers/vm-troubleshooting/)edac,ipmi,pcie,thermal—plus
hypervisor,ovs. Schema v2 → v3 adds the required source /tag enum values (
ipmitool,dmidecode,ethtool,devlink,perfquery, etc.); purely additive.HW-critical classifiers, XID catalog with a generator
(
tools/update-xid-catalog.py) and the upstream-sync shell helper.containerised-hypervisor detection, skipped-artifact placeholders,
fingerprint uniqueness, journal NDJSON support, narrow-terminal
banner fix.
v0.2.0→v0.2.1progression is captured in commithistory.
Dashboard (
customers/vm-troubleshooting-dashboard/— new)Go (
cmd/+internal/api,ingest,storeon SQLite) + React 19(Vite, TanStack Query, Tailwind 4, Vitest). 58 Go tests, 87 frontend
tests.
(
skip_reasons_json,uptime_seconds,issue_state,component).Never rejects unknown fields/enums (see
AGENTS.mdforward-compatcontract).
(title, code, category, collector), composite rank (severity → confidence → count →recency), threshold-coloured fact tiles, boot-time annotation,
Ack / Dismiss workflow keyed by fingerprint (survives re-ingest),
JSON artifact tree with severity markers,
#L42line deep links,copy-summary / copy-fingerprint, prev/next with
j/k.journal_errors.ndjsonre-parsed atingest to lift
SYSLOG_IDENTIFIERintoIssueRecord.component;bounded frontend regex fallback (
component.ts— one rule perfamily, no-merge signal on growth). See
.claude-ngc/plans/woolly-rolling-mccarthy.mdfor the extensibilitystory and the tactical-debt note on the duplicated normalizer.
bad/warn/unavailablethresholds; dotted-key fallback sosystem.iommu_enabledresolves the same entry as the bare form.Dockerfile,docker-compose.yml,Caddyfile.example,oauth2-proxy.cfg.example,DEPLOYMENT.md.